Aim
The project aims to create machine learning models using information taken from digitalized fine needle aspirate (FNA) pictures to predict whether breast cancer is benign or malignant and will choose the best ML model for this dataset.
Dataset: https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data/data
UCI: https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic
Import libaries and load data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_curve, auc
breast_data = pd.read_csv('data.csv')
breast_data.head()
| id | diagnosis | radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave points_mean | ... | texture_worst | perimeter_worst | area_worst | smoothness_worst | compactness_worst | concavity_worst | concave points_worst | symmetry_worst | fractal_dimension_worst | Unnamed: 32 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 842302 | M | 17.99 | 10.38 | 122.80 | 1001.0 | 0.11840 | 0.27760 | 0.3001 | 0.14710 | ... | 17.33 | 184.60 | 2019.0 | 0.1622 | 0.6656 | 0.7119 | 0.2654 | 0.4601 | 0.11890 | NaN |
| 1 | 842517 | M | 20.57 | 17.77 | 132.90 | 1326.0 | 0.08474 | 0.07864 | 0.0869 | 0.07017 | ... | 23.41 | 158.80 | 1956.0 | 0.1238 | 0.1866 | 0.2416 | 0.1860 | 0.2750 | 0.08902 | NaN |
| 2 | 84300903 | M | 19.69 | 21.25 | 130.00 | 1203.0 | 0.10960 | 0.15990 | 0.1974 | 0.12790 | ... | 25.53 | 152.50 | 1709.0 | 0.1444 | 0.4245 | 0.4504 | 0.2430 | 0.3613 | 0.08758 | NaN |
| 3 | 84348301 | M | 11.42 | 20.38 | 77.58 | 386.1 | 0.14250 | 0.28390 | 0.2414 | 0.10520 | ... | 26.50 | 98.87 | 567.7 | 0.2098 | 0.8663 | 0.6869 | 0.2575 | 0.6638 | 0.17300 | NaN |
| 4 | 84358402 | M | 20.29 | 14.34 | 135.10 | 1297.0 | 0.10030 | 0.13280 | 0.1980 | 0.10430 | ... | 16.67 | 152.20 | 1575.0 | 0.1374 | 0.2050 | 0.4000 | 0.1625 | 0.2364 | 0.07678 | NaN |
5 rows × 33 columns
breast_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 569 entries, 0 to 568 Data columns (total 33 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 569 non-null int64 1 diagnosis 569 non-null object 2 radius_mean 569 non-null float64 3 texture_mean 569 non-null float64 4 perimeter_mean 569 non-null float64 5 area_mean 569 non-null float64 6 smoothness_mean 569 non-null float64 7 compactness_mean 569 non-null float64 8 concavity_mean 569 non-null float64 9 concave points_mean 569 non-null float64 10 symmetry_mean 569 non-null float64 11 fractal_dimension_mean 569 non-null float64 12 radius_se 569 non-null float64 13 texture_se 569 non-null float64 14 perimeter_se 569 non-null float64 15 area_se 569 non-null float64 16 smoothness_se 569 non-null float64 17 compactness_se 569 non-null float64 18 concavity_se 569 non-null float64 19 concave points_se 569 non-null float64 20 symmetry_se 569 non-null float64 21 fractal_dimension_se 569 non-null float64 22 radius_worst 569 non-null float64 23 texture_worst 569 non-null float64 24 perimeter_worst 569 non-null float64 25 area_worst 569 non-null float64 26 smoothness_worst 569 non-null float64 27 compactness_worst 569 non-null float64 28 concavity_worst 569 non-null float64 29 concave points_worst 569 non-null float64 30 symmetry_worst 569 non-null float64 31 fractal_dimension_worst 569 non-null float64 32 Unnamed: 32 0 non-null float64 dtypes: float64(31), int64(1), object(1) memory usage: 146.8+ KB
Data cleaning
breast_data.drop(columns = ['id','Unnamed: 32'], axis=1, inplace=True)
missing_values = breast_data.isnull().sum()
print("Missing Values:\n", missing_values)
Missing Values: diagnosis 0 radius_mean 0 texture_mean 0 perimeter_mean 0 area_mean 0 smoothness_mean 0 compactness_mean 0 concavity_mean 0 concave points_mean 0 symmetry_mean 0 fractal_dimension_mean 0 radius_se 0 texture_se 0 perimeter_se 0 area_se 0 smoothness_se 0 compactness_se 0 concavity_se 0 concave points_se 0 symmetry_se 0 fractal_dimension_se 0 radius_worst 0 texture_worst 0 perimeter_worst 0 area_worst 0 smoothness_worst 0 compactness_worst 0 concavity_worst 0 concave points_worst 0 symmetry_worst 0 fractal_dimension_worst 0 dtype: int64
duplicate_rows = breast_data.duplicated()
print("Number of duplicate rows:", duplicate_rows.sum())
Number of duplicate rows: 0
# Summary Statistics for Features
feature_summary = breast_data.drop('diagnosis', axis=1).describe()
feature_summary
| radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave points_mean | symmetry_mean | fractal_dimension_mean | ... | radius_worst | texture_worst | perimeter_worst | area_worst | smoothness_worst | compactness_worst | concavity_worst | concave points_worst | symmetry_worst | fractal_dimension_worst | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | ... | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 | 569.000000 |
| mean | 14.127292 | 19.289649 | 91.969033 | 654.889104 | 0.096360 | 0.104341 | 0.088799 | 0.048919 | 0.181162 | 0.062798 | ... | 16.269190 | 25.677223 | 107.261213 | 880.583128 | 0.132369 | 0.254265 | 0.272188 | 0.114606 | 0.290076 | 0.083946 |
| std | 3.524049 | 4.301036 | 24.298981 | 351.914129 | 0.014064 | 0.052813 | 0.079720 | 0.038803 | 0.027414 | 0.007060 | ... | 4.833242 | 6.146258 | 33.602542 | 569.356993 | 0.022832 | 0.157336 | 0.208624 | 0.065732 | 0.061867 | 0.018061 |
| min | 6.981000 | 9.710000 | 43.790000 | 143.500000 | 0.052630 | 0.019380 | 0.000000 | 0.000000 | 0.106000 | 0.049960 | ... | 7.930000 | 12.020000 | 50.410000 | 185.200000 | 0.071170 | 0.027290 | 0.000000 | 0.000000 | 0.156500 | 0.055040 |
| 25% | 11.700000 | 16.170000 | 75.170000 | 420.300000 | 0.086370 | 0.064920 | 0.029560 | 0.020310 | 0.161900 | 0.057700 | ... | 13.010000 | 21.080000 | 84.110000 | 515.300000 | 0.116600 | 0.147200 | 0.114500 | 0.064930 | 0.250400 | 0.071460 |
| 50% | 13.370000 | 18.840000 | 86.240000 | 551.100000 | 0.095870 | 0.092630 | 0.061540 | 0.033500 | 0.179200 | 0.061540 | ... | 14.970000 | 25.410000 | 97.660000 | 686.500000 | 0.131300 | 0.211900 | 0.226700 | 0.099930 | 0.282200 | 0.080040 |
| 75% | 15.780000 | 21.800000 | 104.100000 | 782.700000 | 0.105300 | 0.130400 | 0.130700 | 0.074000 | 0.195700 | 0.066120 | ... | 18.790000 | 29.720000 | 125.400000 | 1084.000000 | 0.146000 | 0.339100 | 0.382900 | 0.161400 | 0.317900 | 0.092080 |
| max | 28.110000 | 39.280000 | 188.500000 | 2501.000000 | 0.163400 | 0.345400 | 0.426800 | 0.201200 | 0.304000 | 0.097440 | ... | 36.040000 | 49.540000 | 251.200000 | 4254.000000 | 0.222600 | 1.058000 | 1.252000 | 0.291000 | 0.663800 | 0.207500 |
8 rows × 30 columns
Data visualization -
Diagnosis distribution between beign and malignant
diagnosis_distribution = breast_data['diagnosis'].value_counts().reset_index()
diagnosis_distribution.columns = ['Diagnosis', 'Count']
# Assigning colors to each diagnosis category
colors = {'M': 'darkred', 'B': 'steelblue'}
fig = px.bar(diagnosis_distribution, x='Diagnosis', y='Count', color='Diagnosis',
color_discrete_map=colors, title='Distribution of Malignant (M) and Benign (B) Diagnoses',
labels={'Diagnosis': 'Diagnosis', 'Count': 'Count'})
# Customize the layout
fig.update_layout(showlegend=False) # Hide legend for better aesthetics
fig.update_traces(marker_line_width=0) # Remove border around bars for a cleaner look
fig.show()
Heatmap
relationship = breast_data.columns
plt.figure(figsize=(20, 15))
sns.heatmap(breast_data[relationship[1:]].corr(), annot=True, fmt=".2f")
plt.show()
Distribution of features
num_features = df.drop('diagnosis', axis=1)
plt.figure(figsize=(16, 10))
for i, feature in enumerate(num_features.columns, 1):
plt.subplot(5, 6, i)
sns.histplot(num_features[feature], kde=True)
plt.title(f'Distribution of {feature}')
plt.tight_layout()
plt.show()
selected_features = ['concave points_worst', 'perimeter_worst', 'concave points_mean', 'radius_worst',
'perimeter_mean', 'area_worst', 'radius_mean', 'area_mean', 'concavity_mean', 'compactness_mean']
selected_features.append('diagnosis')
selected_breast_data = breast_data[selected_features]
Training test set splitting
By splitting up the data, the machine learning model is better equipped to predict outcomes accurately in new and untested scenarios.
from sklearn.model_selection import train_test_split
# splitting data
X_train, X_test, y_train, y_test = train_test_split(
breast_data.drop('diagnosis', axis=1),
breast_data['diagnosis'],
test_size=0.2,
random_state=42)
print("Shape of training set:", X_train.shape)
print("Shape of test set:", X_test.shape)
Shape of training set: (455, 30) Shape of test set: (114, 30)
The training set (X_train and y_train) is used to train the machine learning model to learn the patterns and relationships within the data.
The test set (X_test and y_test) is then used to evaluate the model's performance by making predictions on unseen data. The difference between the predicted labels and the actual labels in the test set indicates how well the model generalizes to new, unseen instances.
Data scaling
It guarantees that the model can generalise effectively to a variety of datasets, improves performance, and helps to consistent model behaviour.
# scaling data
ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.fit_transform(X_test)
StandardScaler standardizes a feature by subtracting the mean and then scaling to unit variance
KNeighbors Classifier -
The number of neighbours that are taken into account is set by the n_neighbors argument. The method can capture intricate, non-linear decision boundaries and is adaptable and non-parametric. Its performance depends on careful evaluation of distance measures and proper feature scaling.
# to find which value shows the lowest mean error
error_rate = []
for i in range(1,42):
knn = KNeighborsClassifier(n_neighbors=i)
knn.fit(X_train, y_train)
pred_i = knn.predict(X_test)
error_rate.append(np.mean(pred_i != y_test))
plt.figure(figsize=(12,6))
plt.plot(range(1,42), error_rate, color='purple', linestyle="--",
marker='o', markersize=10, markerfacecolor='b')
plt.title('Error_Rate vs K-value')
plt.show()
knn = KNeighborsClassifier(n_neighbors=9)
knn.fit(X_train, y_train)
prediction1 = knn.predict(X_test)
print(confusion_matrix(y_test, prediction1))
print("\n")
print(classification_report(y_test, prediction1))
[[70 1]
[ 4 39]]
precision recall f1-score support
B 0.95 0.99 0.97 71
M 0.97 0.91 0.94 43
accuracy 0.96 114
macro avg 0.96 0.95 0.95 114
weighted avg 0.96 0.96 0.96 114
The classification report, one may learn more about the model's advantages and disadvantages in terms of foretelling benign and malignant situations.
knn_model_acc = accuracy_score(y_test, prediction1)
print("Accuracy of K Neighbors Classifier Model is: ", knn_model_acc)
Accuracy of K Neighbors Classifier Model is: 0.956140350877193
from sklearn.model_selection import cross_val_score
KNeighborsClassifier_cross_val = cross_val_score(KNeighborsClassifier(),X_train,y_train)
print("Cross validation score of KNeighborsClassifier Model:")
count = 0
for i in KNeighborsClassifier_cross_val:
count+=1
print(f"{count}) {round(i*100, ndigits = 2)} %")
Cross validation score of KNeighborsClassifier Model: 1) 96.7 % 2) 95.6 % 3) 98.9 % 4) 96.7 % 5) 92.31 %
Cross-validation gives a more accurate estimation of the model's performance and makes sure it is not overfitting to a particular train-test split. This is critical for healthcare applications since the model's generalisation capacity and reliability are critical for producing precise predictions on new, unseen patient data.
Linear Regression model done from scratch -
linear regression is used in the context of breast cancer prediction. It attempts to establish a linear relationship between measurable features of cancer cells and a quantitative measure of cancer severity.
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
# Load breast cancer dataset
data = load_breast_cancer()
X = data.data[:, 0].reshape(-1, 1) # Using only one feature for simplicity
y = data.target.reshape(-1, 1)
# Linear Regression Implementation
class LinearRegression:
def __init__(self, learning_rate=0.01, n_iterations=1000):
self.learning_rate = learning_rate
self.n_iterations = n_iterations
self.weights = None
self.bias = None
def fit(self, X, y):
# Add a bias term to the input
X_bias = np.c_[np.ones((X.shape[0], 1)), X]
# Initialize weights and bias
self.weights = np.random.randn(X_bias.shape[1], 1)
self.bias = np.zeros((1, 1))
for _ in range(self.n_iterations):
# Compute predictions
predictions = np.dot(X_bias, self.weights) + self.bias
# Compute gradients
dw = (1 / X_bias.shape[0]) * np.dot(X_bias.T, (predictions - y))
db = (1 / X_bias.shape[0]) * np.sum(predictions - y)
# Update weights and bias
self.weights -= self.learning_rate * dw
self.bias -= self.learning_rate * db
def predict(self, X):
X_bias = np.c_[np.ones((X.shape[0], 1)), X]
return np.dot(X_bias, self.weights) + self.bias
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create and train the Linear Regression model
linear_reg_model = LinearRegression()
linear_reg_model.fit(X_train, y_train)
# Make predictions
X_test_bias = np.c_[np.ones((X_test.shape[0], 1)), X_test]
predictions = linear_reg_model.predict(X_test)
# Plot the data and the linear regression line
plt.scatter(X_test[:, 0], y_test[:, 0], label='Data')
plt.plot(X_test[:, 0], predictions[:, 0], label='Linear Regression', color='red')
plt.xlabel('Feature')
plt.ylabel('Target')
plt.legend()
plt.show()
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.linear_model import LinearRegression
# Load breast cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target
# Convert target to binary labels (0 or 1)
y_binary = (y > 0).astype(int)
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y_binary, test_size=0.2, random_state=42)
# Add a bias term to the input for linear regression
X_train_bias = np.c_[np.ones((X_train.shape[0], 1)), X_train]
X_test_bias = np.c_[np.ones((X_test.shape[0], 1)), X_test]
# Create and train the Linear Regression model
linear_reg_model = LinearRegression()
linear_reg_model.fit(X_train_bias, y_train)
# Make predictions
predictions = linear_reg_model.predict(X_test_bias)
# Convert predictions to binary labels (0 or 1)
predictions_binary = (predictions > 0.5).astype(int)
# Display classification report
print("Classification Report:\n", classification_report(y_test, predictions_binary))
Classification Report:
precision recall f1-score support
0 0.95 0.88 0.92 43
1 0.93 0.97 0.95 71
accuracy 0.94 114
macro avg 0.94 0.93 0.93 114
weighted avg 0.94 0.94 0.94 114
# Linear Regression Implementation
class LinearRegressionModel:
def __init__(self):
self.weights = None
self.bias = None
def fit(self, X, y):
# Add a bias term to the input
X_bias = np.c_[np.ones((X.shape[0], 1)), X]
# Compute the weights using the normal equation
theta = np.linalg.inv(X_bias.T @ X_bias) @ X_bias.T @ y
# Extract weights and bias
self.bias = theta[0]
self.weights = theta[1:]
def predict(self, X):
X_bias = np.c_[np.ones((X.shape[0], 1)), X]
return X_bias @ np.concatenate(([self.bias], self.weights))
# Create and train the Linear Regression model
linear_reg_model = LinearRegressionModel()
linear_reg_model.fit(X_train, y_train)
# Make predictions
predictions = linear_reg_model.predict(X_test)
# Calculate Mean Squared Error for evaluation
mse = mean_squared_error(y_test, predictions)
print("Mean Squared Error:", mse)
# Perform k-fold cross-validation
def k_fold_cross_validation(model, X, y, k=5):
kf = KFold(n_splits=k, shuffle=True, random_state=42)
mse_scores = []
for train_index, test_index in kf.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model.fit(X_train, y_train)
predictions = model.predict(X_test)
mse = mean_squared_error(y_test, predictions)
mse_scores.append(mse)
return np.mean(mse_scores)
# Use k-fold cross-validation with the Linear Regression model
mse_cv = k_fold_cross_validation(linear_reg_model, X, y)
print(f'Mean Squared Error (Cross-Validation): {mse_cv}')
Mean Squared Error: 0.06410886246958959 Mean Squared Error (Cross-Validation): 0.0626649678814163
Naive Bayes -
To predict whether a tumour is likely to be benign or malignant, the Naive Bayes method uses the probability of these characteristics given the class labels.
nb = GaussianNB() #We are building our model
nb.fit(X_train,y_train) #We are training our model
print("Print accuracy of naive bayes algo: {}".format(nb.score(X_test,y_test)))
nb_acc_score = nb.score(X_test,y_test)
Print accuracy of naive bayes algo: 0.956140350877193
y_pred = nb.predict(X_test)
y_true = y_test
cm = confusion_matrix(y_true, y_pred)
#visualize
f, ax = plt.subplots(figsize=(5,5))
sns.heatmap(cm,annot = True, linewidths=0.5,linecolor="red",fmt = ".0f",ax=ax)
plt.xlabel("y_pred")
plt.ylabel("y_true")
plt.show()
print(confusion_matrix(y_test,y_pred))
print("\n")
print(classification_report(y_test, y_pred))
[[70 1]
[ 4 39]]
precision recall f1-score support
B 0.95 0.99 0.97 71
M 0.97 0.91 0.94 43
accuracy 0.96 114
macro avg 0.96 0.95 0.95 114
weighted avg 0.96 0.96 0.96 114
from sklearn.model_selection import cross_val_score
GaussianNB_cross_val = cross_val_score(GaussianNB(),X_train,y_train)
print("Cross validation score of Bayesian classification Model:")
count = 0
for i in GaussianNB_cross_val:
count+=1
print(f"{count}) {round(i*100, ndigits = 2)} %")
Cross validation score of Bayesian classification Model: 1) 90.11 % 2) 96.7 % 3) 93.41 % 4) 93.41 % 5) 93.41 %
Logistic Regression -
Because they can produce distinct binary outcomes, logistic regression and other classification models are frequently used in the medical field for tasks like cancer prediction.
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
predictions1 = logreg.predict(X_test)
print("Confusion Matrix: \n", confusion_matrix(y_test, predictions1))
print('\n')
print(classification_report(y_test, predictions1))
Confusion Matrix:
[[71 0]
[ 2 41]]
precision recall f1-score support
B 0.97 1.00 0.99 71
M 1.00 0.95 0.98 43
accuracy 0.98 114
macro avg 0.99 0.98 0.98 114
weighted avg 0.98 0.98 0.98 114
logreg_acc = accuracy_score(y_test, predictions1)
print("Accuracy of the Logistic Regression Model is: ", logreg_acc)
Accuracy of the Logistic Regression Model is: 0.9824561403508771
from sklearn.model_selection import cross_val_score
LogisticRegression_cross_val = cross_val_score(LogisticRegression(),X_train,y_train)
print("Cross validation score of Logistic Regression Model:")
count = 0
for i in LogisticRegression_cross_val:
count+=1
print(f"{count}) {round(i*100, ndigits = 2)} %")
Cross validation score of Logistic Regression Model: 1) 97.8 % 2) 96.7 % 3) 100.0 % 4) 97.8 % 5) 94.51 %
Conclusion -
The best ML model for this dataset will be determined by each models accuracy.
Accuracy of linear regression model: 95
Accuracy of naive bayes model: 95
Accuracy of the Logistic Regression Model is: 98
Accuracy of K Neighbors Classifier Model is: 95
Hence, the best model for this dataset is logistic regression models.
This model will help to predict whether the breast cancer diagnosis is benign or malignant